KEP-5593: Configure the max CrashLoopBackOff delay #5594

hankfreund · 2025-09-30T21:53:05Z

One-line PR description: Splitting KEP-4603: Tune CrashLoopBackoff into two KEPs.

Issue link: Configure the max CrashLoopBackOff delay #5593

Other comments: No material changes have been made to either KEP. Content was removed from one or the other and grammar has been updated to make sense.

k8s-ci-robot · 2025-09-30T21:53:14Z

Welcome @hankfreund!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-09-30T21:53:15Z

Hi @hankfreund. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tallclair · 2025-10-01T16:24:52Z

/ok-to-test

tallclair

Need to add a prod readiness file: keps/prod-readiness/sig-node/5593.yaml

tallclair · 2025-10-02T05:56:32Z

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md

+
+#### Beta
+
+- Gather feedback from developers and surveys


Do we have any feedback? I'm not sure we want to block the beta on this.

tallclair · 2025-10-02T05:59:05Z

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md

+will rollout across nodes.
+-->
+
+<<[UNRESOLVED beta]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>


If we're targetting beta this release, this needs to be filled out. Or were you planning to cover this in a follow-up PR?

Risk is that a configured crashloop backoff causes the kubelet to become unstable. If that happens, a rollback just requires updating the config and restarting kubelet.

I wasn't sure initially if I should do it all in one, but it makes sense. Updated this and all the following sections.

tallclair · 2025-10-02T06:00:17Z

keps/prod-readiness/sig-node/5593.yaml

@@ -0,0 +1,3 @@
+kep-number: 5593
+alpha:
+  approver: TBD


This was already approved for alpha. You're just splitting the previous enhancement into 2 parts. I think you can put @soltysh here (copied from https://github.com/kubernetes/enhancements/blob/511963e97f955f97e9842ae3015b60af956539b3/keps/prod-readiness/sig-node/4603.yaml)

tallclair · 2025-10-02T06:01:57Z

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md

+* `kubelet_pod_start_sli_duration_seconds`
+
+
+###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?


No, but I think you can just put N/A here. This feature is stateless.

I can agree with the stateless fact, but I need those feature on/off tests linked in the previous sections. Then update this section mentioning that b/c it's stateless it's sufficient to verify that turning the feature gate on and off works as expected.

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md

lauralorenz

A few comments focused on clean separation between the two KEPs

lauralorenz · 2025-10-02T20:01:43Z

keps/sig-node/4603-tune-crashloopbackoff/README.md

 (Success) and the pod is transitioned into a "Completed" state or the expected
 length of the pod run is less than 10 minutes.

 This KEP proposes the following changes:


nit

Suggested change

This KEP proposes the following changes:

This KEP proposes the following change:

lauralorenz · 2025-10-02T23:26:53Z

keps/sig-node/4603-tune-crashloopbackoff/README.md

-
 Some observations and analysis were made to quantify these risks going into
 alpha. In the [Kubelet Overhead Analysis](#kubelet-overhead-analysis), the code
 paths all restarting pods go through result in 5 obvious `/pods` API server


I can't comment directly because its out of range of the diff but lines 499-508 contain references to the per Node feature still

I went through and I think I got all of them.

lauralorenz · 2025-10-02T23:38:50Z

keps/sig-node/4603-tune-crashloopbackoff/README.md

-included in the `config.validation_test` package.

 ### Rollout, Upgrade and Rollback Planning



On and after line 1149 (ref) in the Scalability section, the per Node feature is referenced, I suggest linking out to the other KEP there for inline context

lauralorenz · 2025-10-02T23:46:09Z

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md

+This KEP proposes the following changes:
+* Provide a knob to cluster operators to configure maximum backoff down, to
+  minimum 1s, at the node level
+* Formally split image pull backoff and container restart backoff behavior


Should this bullet point still be in the other KEP as well since it was also done alongside that (or alternatively, should this bullet point be taken out of the top level content for both)? I think it was important to include references to these refactorings above the fold during the alpha phase so it was clear what was happening, but less important now that the alpha is implemented

I think removing it is all right.

lauralorenz · 2025-10-02T23:54:32Z

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md

+rate limiting made up the gap to the stability of the system. Therefore, to
+simplify both the implementation and the API surface, this 1.32 proposal puts
+forth that the opt-in will be configured per node via kubelet configuration.
+


Now that this is the only feature referred to in this KEP, I feel like this section would read better with a subheading here like ### Implementing with KubeletConfiguration or something. Before it was all smooshed together since there were already so many H3s lol but that's not the case anymore

lauralorenz · 2025-10-03T00:01:16Z

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md

+
+All behavior changes are local to the kubelet component and its start up
+configuration, so a mix of different (or unset) max backoff durations will not
+cause issues.


Just noticed that this sentence is kinda vague.

Suggested change

cause issues.

cause issues to running workloads.

lauralorenz · 2025-10-03T00:03:34Z

keps/sig-node/4603-tune-crashloopbackoff/README.md

-* Formally split backoff counter reset threshold for container restart backoff
-  behavior and maintain the current 10 minute recovery threshold
+* Provide an alpha-gated change to get feedback and periodic scalability tests
+  on changes to the global initial backoff to 1s and maximum backoff to 1 minute


Consider adding a sentence to both Overviews about how this was originally a bigger KEP that has been split out into two, and link to the other one there, so its quickly in context for new or returning readers.

lauralorenz · 2025-10-03T00:04:05Z

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md

+* Formally split image pull backoff and container restart backoff behavior
+* Formally split backoff counter reset threshold for container restart backoff
+  behavior and maintain the current 10 minute recovery threshold
+


X-post from the other one: Consider adding a section to both Overviews about how this was originally a bigger KEP that has been split out into two, and link to the other one there, so its quickly in context for new or returning readers.

SergeyKanzhelev · 2025-10-09T19:58:52Z

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/kep.yaml

+reviewers:
+  - "@tallclair"
+approvers:
+  - TBD


@mrunalp can you take this?

@hankfreund please add @mrunalp here

SergeyKanzhelev · 2025-10-09T20:00:05Z

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md

+question.
+-->
+
+<<[UNRESOLVED beta]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>


need to be filled up

same as below

Got it. I think all the required sections are filled out now.

SergeyKanzhelev

from sig node perspective - just copy of what we got to alpha already. Pretty straightforward.

Couple of PRR questions are unanswered

SergeyKanzhelev · 2025-10-14T17:52:55Z

/assign @mrunalp

SergeyKanzhelev · 2025-10-14T17:53:16Z

/lgtm

(except approver needs to be listed)

k8s-ci-robot · 2025-10-14T17:55:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hankfreund, mrunalp
Once this PR has been reviewed and has the lgtm label, please assign jpbetz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
~~keps/sig-node/OWNERS~~ [mrunalp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Tuning and benchmarking a new crashloopbackoff decay will take a lot of work. In the meantime, everyone can benefit from a per-node configurable max crashloopbackoff delay. Splitting the KEP into two KEPs to allow for graduating the latter to beta before the former.

This KEP is mostly a copy of keps/sig-node/4603-tune-crashloopbackoff with all the tuning bits removed (and grammar adjusted to make sense). The desire is to advance this KEP to beta sooner than we'd be able to advance the other one.

SergeyKanzhelev · 2025-10-14T18:05:01Z

/assign @soltysh

/lgtm

@soltysh to clarify - this is forking of a KEP with two feature gates into two KEPs - one per FG. The one that is moving to beta is only about a new config paramter that controls the CrashLoopBackoffPeriod

soltysh

From a PRR pov mostly missing test links.

soltysh · 2025-10-14T18:23:28Z

keps/sig-node/4603-tune-crashloopbackoff/kep.yaml

 # The following PRR answers are required at alpha release
 # List the feature gate name and the components for which it must be enabled
 feature-gates:
  - name: ReduceDefaultCrashLoopBackoffDecay


Nit: but above there's see-also section you could update to point to the other KEP:

see-also: - "/keps/sig-node/5593-configure-the-max-crashloopbackoff-delay"

soltysh · 2025-10-14T18:24:32Z

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/kep.yaml

+title: Configure the max CrashLoopBackOff delay
+kep-number: 5593
+authors:
+  - "@lauralorenz"


Nit: I'm assuming a lot is copied from the other KEP, but still I'd add hankfreund here

soltysh · 2025-10-14T18:25:10Z

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md

+- [ ] (R) Production readiness review approved
+- [ ] "Implementation History" section is up-to-date for milestone
+- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
+- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes


Ni: please make sure to update this checklist, ✔️ the appropriate ones.

soltysh · 2025-10-14T18:35:33Z

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md

+  [testgrid](https://testgrid.k8s.io/sig-testing-canaries#pull-kubernetes-integration-go-canary),
+  [latest
+  prow](https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-integration-go-canary/1710565150676750336)
+  * test with and without feature flags enabled


Can you update these links so they point to the exact tests verifying feature gate on and off? Are there any additional integration tests for this feature, if yes please provide here the necessary links.

soltysh · 2025-10-14T18:36:16Z

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md

+We expect no non-infra related flakes in the last month as a GA graduation criteria.
+-->
+
+- Crashlooping container that restarts some number of times (ex 10 times),


In the graduation criteria below you're mentioning e2e for alpha, can you update this section with appropriate links to specific tests?

soltysh · 2025-10-14T18:38:16Z

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md

+
+No coordination needs to be done between the control plane and the nodes; all
+behavior changes are local to the kubelet component and its start up
+configuration. An n-3 kube-proxy, n-1kube-controller-manager, or n-1


Suggested change

configuration. An n-3 kube-proxy, n-1kube-controller-manager, or n-1

configuration. An n-3 kube-proxy, n-1 kube-controller-manager, or n-1

soltysh · 2025-10-14T18:40:15Z

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md

+and discussions with other contributors indicate that while little in core
+kubernetes does strict parsing, it's not well tested. At minimum as part of this
+implementation a test covering this for `KubeletConfiguration` objects will be
+included in the `config.validation_test` package.


Since this seems copied from the other KEP, were those tests added, if so can you link them here?

soltysh · 2025-10-14T18:41:58Z

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md

+* `kubelet_pod_start_sli_duration_seconds`
+
+
+###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?


I can agree with the stateless fact, but I need those feature on/off tests linked in the previous sections. Then update this section mentioning that b/c it's stateless it's sufficient to verify that turning the feature gate on and off works as expected.

soltysh · 2025-10-14T18:42:53Z

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md

+implementation difficulties, etc.).
+-->
+
+N/A


No will be better answer.

soltysh · 2025-10-14T18:43:57Z

keps/sig-node/5593-configure-the-max-crashloopbackoff-delay/README.md

+-->
+
+Maybe! As containers could be restarting more, this may affect "Startup latency
+of schedulable stateless pods", "Startup latency of schedule stateful pods".


Have you performed any measurements as to how significant that degradation can be? Similarly how you provided rough estimations for increased CPU usage in the next question.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 30, 2025

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Sep 30, 2025

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 30, 2025

k8s-ci-robot requested review from derekwaynecarr and mrunalp September 30, 2025 21:53

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 30, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 1, 2025

tallclair reviewed Oct 1, 2025

View reviewed changes

hankfreund force-pushed the clb_split_kep branch from cacae2d to 511963e Compare October 1, 2025 20:05

tallclair reviewed Oct 2, 2025

View reviewed changes

hankfreund force-pushed the clb_split_kep branch from 511963e to d7ce106 Compare October 2, 2025 14:12

lauralorenz reviewed Oct 3, 2025

View reviewed changes

hankfreund force-pushed the clb_split_kep branch 3 times, most recently from 1d0be4d to 684ee23 Compare October 6, 2025 22:11

Priyankasaggu11929 mentioned this pull request Oct 9, 2025

Configure the max CrashLoopBackOff delay #5593

Open

4 tasks

SergeyKanzhelev reviewed Oct 9, 2025

View reviewed changes

hankfreund force-pushed the clb_split_kep branch from 684ee23 to 6105155 Compare October 9, 2025 21:32

k8s-ci-robot assigned mrunalp Oct 14, 2025

k8s-ci-robot assigned SergeyKanzhelev Oct 14, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 14, 2025

mrunalp approved these changes Oct 14, 2025

View reviewed changes

hankfreund added 2 commits October 14, 2025 17:59

hankfreund force-pushed the clb_split_kep branch from 6105155 to e181610 Compare October 14, 2025 18:00

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 14, 2025

k8s-ci-robot assigned soltysh Oct 14, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 14, 2025

soltysh reviewed Oct 14, 2025

View reviewed changes

		* `kubelet_pod_start_sli_duration_seconds`


		###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

	This KEP proposes the following changes:
	This KEP proposes the following change:

		included in the `config.validation_test` package.

		### Rollout, Upgrade and Rollback Planning

	configuration. An n-3 kube-proxy, n-1kube-controller-manager, or n-1
	configuration. An n-3 kube-proxy, n-1 kube-controller-manager, or n-1

KEP-5593: Configure the max CrashLoopBackOff delay #5594

Are you sure you want to change the base?

KEP-5593: Configure the max CrashLoopBackOff delay #5594

Conversation

hankfreund commented Sep 30, 2025

Uh oh!

k8s-ci-robot commented Sep 30, 2025

Uh oh!

k8s-ci-robot commented Sep 30, 2025

Uh oh!

tallclair commented Oct 1, 2025

Uh oh!

tallclair left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lauralorenz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment